Efficient Off-line and On-line Algorithms for Training Conditional Random Fields by Approximating Jacobians
نویسندگان
چکیده
Previously, algorithms for training conditional random fields were derived from the second-order Taylor expansion of the log-likelihood of the training data and the key issue is to approximate the Hessian matrix, which usually requires O(n) computations per iteration. In this paper, we show that efficient off-line and online algorithms for training conditional random fields (CRF) can also be derived from approximating the Jacobian of the generalized iterative scaling (GIS) mapping to reduce per iteration complexity to O(n). Experimental results show that in terms of the rate of convergence, algorithms derived in this way can be as efficient as the best performing second-order algorithms or even better. 1 Approximating Jacobian The training of CRF can be formulated as solving Θ = M(Θ) by fixed-point iteration. Assuming that at the t-th iteration, Θ is in the neighborhood of Θ and the mapping M is differentiable. Then we can apply a linear Taylor expansion of M around Θ so that Θ = M(Θ) ≈ Θ + J(Θ −Θ), where J := M (Θ). The multivariate Aitken’s acceleration is given by Θ = Θ + (I − J)(M(Θ)−Θ), (1) where I is identity matrix. An accurate estimation of J can extrapolate the search directly to Θ. The componentwise triple jump extrapolation method [1] simplifies Aitken’s acceleration by replacing J with a diagonal matrix diag(γ p ), where scalar values γ (t) p are approximation of the eigenvalue of J defined by: γ p := [M(Θ)]p − θ (t) p θ (t) p − θ (t−1) p . (2) [M(Θ]p is the p-th element of the output vector of M(Θ). This method is referred to as the triple jump method because the mapping M is applied twice to obtain M(Θ) = Θ and M(Θ) before Equation (2) is applied to make a large extrapolation in an attempt to reach the optimum. Though the triple jump method, as all variants of Aitken’s acceleration, may not improve Θ monotonically, we can apply the idea proposed by [5] to guarantee convergence. The idea is to discard the extrapolation if it fails to improve Θ and use the estimate obtained without the extrapolation. In this way, convergence can be guaranteed. It is also possible to approximate J with a scalar value, [1, 2] show that for CRF, when the features are independent, componentwise extrapolation given in Equation (2) should be preferred. Clearly, the time complexity of Equation (2) is O(n), where n is the dimension of Θ. In contrast, second-order algorithms usually require O(n) computations or worse to ensure accurate approximation of the Hessians. 2 Off-Line Algorithm: CTJPGIS CTJPGIS is the abbreviation of “the componentwise triple jump method for penalized generalized iterative scaling.” CTJPGIS is derived from the generalized iterative scaling (GIS) method [4]. GIS usually converges prohibitively slow. Let D := {x1, . . . , xK} denote a set of K data sequences and {y1, . . . , yK} the corresponding labels. Training of CRFs is to search for the weight vector Θ that minimizes the negative penalized log-likelihood function as the objective function, denoted by L(Θ;D). Usually we use Gaussian priors to avoid overfitting. The penalized log-likelihood functionL(Θ;D) is L(Θ;D) = L(Θ;D)− ∑ i (θi−μ) 2 2σ2 +const., where L(Θ;D) is the log-likelihood function. The gradient along the direction of θi is ∇iL(Θ;D) = Ẽfi − Efi − Θi−μ σ2 , where Ẽfi and Efi are empirical and model expectation of fi, respectively. Solving ∇iL(Θ;D) = 0 yields the penalized GIS (PGIS) algorithm: θi = θi + 1 S log Ẽfi Efi + θi−μ σ2 , (3) where S := maxk ∑ i fi(yk, xk) is the maximum number of feature occurrences in a training sequence, as defined in [4]. Assigning M(Θ) to be the RHS for Equation (3) and apply the extrapolation described in Equation (1), we have the CTJPGIS algorithm for training CRF. 3 On-Line Algorithm: PSA PSA is the abbreviation of “periodic stepsize adaptation” and is derived from stochastic gradient descent (SGD), which approximates the global objective function with small batches:
منابع مشابه
Accelerating generalized Iterative Scaling Based on Staggered Aitken Method for on-Line Conditional Random Fields
In this paper, a convergent method based on Generalized Iterative Scaling (GIS) with staggered Aitken acceleration is proposed to estimate the parameters for an on-line Conditional Random Field (CRF). The staggered Aitken acceleration method, which alternates between the acceleration and non-acceleration steps, ensures computational simplicity when analyzing incomplete data. The proposed method...
متن کاملEfficient Training of Conditional Random Fields
This thesis explores a number of parameter estimation techniques for conditional random fields, a recently introduced [31] probabilistic model for labelling and segmenting sequential data. Theoretical and practical disadvantages of the training techniques reported in current literature on CRFs are discussed. We hypothesise that general numerical optimisation techniques result in improved perfor...
متن کاملGeneralized Stacked Sequential Learning
In many supervised learning problems, it is assumed that data is independent and identically distributed. This assumption does not hold true in many real cases, where a neighboring pair of examples and their labels exhibit some kind of relationship. Sequential learning algorithms take benefit of these relationships in order to improve generalization. In the literature, there are different appro...
متن کاملGenetic and Memetic Algorithms for Sequencing a New JIT Mixed-Model Assembly Line
This paper presents a new mathematical programming model for the bi-criteria mixed-model assembly line balancing problem in a just-in-time (JIT) production system. There is a set of criteria to judge sequences of the product mix in terms of the effective utilization of the system. The primary goal of this model is to minimize the setup cost and the stoppage assembly line cost, simultaneously. B...
متن کاملAdaptive Stochastic Dual Coordinate Ascent for Conditional Random Fields
This work investigates training Conditional Random Fields (CRF) by Stochastic Dual Coordinate Ascent (SDCA). SDCA enjoys a linear convergence rate and a strong empirical performance for independent classification problems. However, it has never been used to train CRF. Yet it benefits from an exact line search with a single marginalization oracle call, unlike previous approaches. In this paper, ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007